{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tce3stUlHN0L"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow IO Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "tuOe1ymfHZPu"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qFdPvlXBOdUN"
      },
      "source": [
        "# 从 TensorFlow IO 中读取 PostgreSQL 数据库"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MfBg1C5NB3X0"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td><a target=\"_blank\" href=\"https://tensorflow.google.cn/io/tutorials/postgresql\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">在 TensorFlow.org 上查看 </a></td>\n",
        "  <td><a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/io/tutorials/postgresql.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 中运行 </a></td>\n",
        "  <td><a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/io/tutorials/postgresql.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 GitHub 中查看源代码</a></td>\n",
        "      <td><a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/io/tutorials/postgresql.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\">下载笔记本</a></td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xHxb-dlhMIzW"
      },
      "source": [
        "## 概述\n",
        "\n",
        "本教程介绍了如何从 PostgreSQL 数据库服务器创建 `tf.data.Dataset`，以便可以将创建的 `Dataset` 传递给 `tf.keras` 进行训练或推理。\n",
        "\n",
        "SQL 数据库是数据科学家的重要数据来源。[PostgreSQL](https://www.postgresql.org) 是最流行的开源 SQL 数据库之一，在企业中广泛用于全面存储关键数据和事务数据。直接从 PostgreSQL 数据库服务器创建 `Dataset` 并将 `Dataset` 传递给 `tf.keras` 进行训练或推理，可以极大地简化数据流水线，并帮助数据科学家专注于构建机器学习模型。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MUXex9ctTuDB"
      },
      "source": [
        "## 设置和用法"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "upgCc3gXybsA"
      },
      "source": [
        "### 安装所需的 tensorflow-io 软件包，然后重新启动运行时"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uUDYyMZRfkX4"
      },
      "outputs": [],
      "source": [
        "try:\n",
        "  %tensorflow_version 2.x\n",
        "except Exception:\n",
        "  pass\n",
        "\n",
        "!pip install tensorflow-io"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yZmI7l_GykcW"
      },
      "source": [
        "### 安装和设置 PostgreSQL（可选）\n",
        "\n",
        "**警告：此笔记本只适合在 Google Colab 中运行**。*它会在系统上安装软件包，并且需要 sudo 访问。如果要在本地 Jupyter 笔记本中运行，请谨慎操作。*\n",
        "\n",
        "为了演示在 Google Colab 上的用法，您将安装 PostgreSQL 服务器。此外，还需要密码和一个空数据库。\n",
        "\n",
        "如果您不在 Google Colab 上运行此笔记本，或者希望使用现有数据库，请跳过以下设置并前进到下一部分。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YUj0878jPyz7"
      },
      "outputs": [],
      "source": [
        "# Install postgresql server\n",
        "!sudo apt-get -y -qq update\n",
        "!sudo apt-get -y -qq install postgresql\n",
        "!sudo service postgresql start\n",
        "\n",
        "# Setup a password `postgres` for username `postgres`\n",
        "!sudo -u postgres psql -U postgres -c \"ALTER USER postgres PASSWORD 'postgres';\"\n",
        "\n",
        "# Setup a database with name `tfio_demo` to be used\n",
        "!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS tfio_demo;'\n",
        "!sudo -u postgres psql -U postgres -c 'CREATE DATABASE tfio_demo;'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e4_59-RLMRgT"
      },
      "source": [
        "### 设置必要的环境变量\n",
        "\n",
        "以下环境变量基于上一部分中的 PostgreSQL 设置。如果您使用其他设置或使用现有数据库，则应相应地更改这些设置：\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0dRotqDMswcK"
      },
      "outputs": [],
      "source": [
        "%env TFIO_DEMO_DATABASE_NAME=tfio_demo\n",
        "%env TFIO_DEMO_DATABASE_HOST=localhost\n",
        "%env TFIO_DEMO_DATABASE_PORT=5432\n",
        "%env TFIO_DEMO_DATABASE_USER=postgres\n",
        "%env TFIO_DEMO_DATABASE_PASS=postgres"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9reCVv0mE_9O"
      },
      "source": [
        "### 在 PostgreSQL 服务器中准备数据\n",
        "\n",
        "出于演示目的，本教程将创建一个数据库，并使用一些数据填充该数据库。本教程中使用的数据来自 [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) 中提供的 [Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality)。\n",
        "\n",
        "下面是 Air Quality Data Set 的一个子集的预览：\n",
        "\n",
        "日期 | 时间 | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH\n",
        "--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---\n",
        "10/03/2004 | 18.00.00 | 2.6 | 1360 | 150 | 11.9 | 1046 | 166 | 1056 | 113 | 1692 | 1268 | 13.6 | 48.9 | 0.7578\n",
        "10/03/2004 | 19.00.00 | 2 | 1292 | 112 | 9.4 | 955 | 103 | 1174 | 92 | 1559 | 972 | 13.3 | 47.7 | 0.7255\n",
        "10/03/2004 | 20.00.00 | 2.2 | 1402 | 88 | 9.0 | 939 | 131 | 1140 | 114 | 1555 | 1074 | 11.9 | 54.0 | 0.7502\n",
        "10/03/2004 | 21.00.00 | 2.2 | 1376 | 80 | 9.2 | 948 | 172 | 1092 | 122 | 1584 | 1203 | 11.0 | 60.0 | 0.7867\n",
        "10/03/2004 | 22.00.00 | 1.6 | 1272 | 51 | 6.5 | 836 | 131 | 1205 | 116 | 1490 | 1110 | 11.2 | 59.6 | 0.7888\n",
        "\n",
        "[参考文献](#references)部分提供了有关 Air Quality Data Set 和 UCI Machine Learning Repository 的更多信息。\n",
        "\n",
        "为了帮助简化数据准备工作，我们准备了 sql 版本的 Air Quality Data Set，并将其以 [AirQualityUCI.sql](https://github.com/tensorflow/io/blob/master/docs/tutorials/postgresql/AirQualityUCI.sql) 形式提供。\n",
        "\n",
        "创建表的语句如下：\n",
        "\n",
        "```\n",
        "CREATE TABLE AirQualityUCI (\n",
        "  Date DATE,\n",
        "  Time TIME,\n",
        "  CO REAL,\n",
        "  PT08S1 INT,\n",
        "  NMHC REAL,\n",
        "  C6H6 REAL,\n",
        "  PT08S2 INT,\n",
        "  NOx REAL,\n",
        "  PT08S3 INT,\n",
        "  NO2 REAL,\n",
        "  PT08S4 INT,\n",
        "  PT08S5 INT,\n",
        "  T REAL,\n",
        "  RH REAL,\n",
        "  AH REAL\n",
        ");\n",
        "```\n",
        "\n",
        "用于在数据库中创建表并填充数据的完整命令如下："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "W1eVidg3JrPV"
      },
      "outputs": [],
      "source": [
        "!curl -s -OL https://github.com/tensorflow/io/raw/master/docs/tutorials/postgresql/AirQualityUCI.sql\n",
        "\n",
        "!PGPASSWORD=$TFIO_DEMO_DATABASE_PASS psql -q -h $TFIO_DEMO_DATABASE_HOST -p $TFIO_DEMO_DATABASE_PORT -U $TFIO_DEMO_DATABASE_USER -d $TFIO_DEMO_DATABASE_NAME -f AirQualityUCI.sql"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "acEST3amdyDI"
      },
      "source": [
        "### 从 PostgreSQL 服务器创建数据集并在 TensorFlow 中使用\n",
        "\n",
        "从 PostgreSQL 服务器创建数据集就像调用带 `query` 和 `endpoint` 参数的 `tfio.experimental.IODataset.from_sql` 一样简单。`query` 是对表中选定列的 SQL 查询，而 `endpoint` 参数是地址和数据库名称："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "h21RdP7meGzP"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import tensorflow_io as tfio\n",
        "\n",
        "endpoint=\"postgresql://{}:{}@{}?port={}&amp;dbname={}\".format(\n",
        "    os.environ['TFIO_DEMO_DATABASE_USER'],\n",
        "    os.environ['TFIO_DEMO_DATABASE_PASS'],\n",
        "    os.environ['TFIO_DEMO_DATABASE_HOST'],\n",
        "    os.environ['TFIO_DEMO_DATABASE_PORT'],\n",
        "    os.environ['TFIO_DEMO_DATABASE_NAME'],\n",
        ")\n",
        "\n",
        "dataset = tfio.experimental.IODataset.from_sql(\n",
        "    query=\"SELECT co, pt08s1 FROM AirQualityUCI;\",\n",
        "    endpoint=endpoint)\n",
        "\n",
        "print(dataset.element_spec)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8y-VpwcWNYTF"
      },
      "source": [
        "从上面的 `dataset.element_spec` 的输出中可以看到，创建的 `Dataset` 的元素是一个 Python 字典对象，将数据库表的列名称作为键：\n",
        "\n",
        "```\n",
        "{   'co': TensorSpec(shape=(), dtype=tf.float32, name=None),   'pt08s1': TensorSpec(shape=(), dtype=tf.int32, name=None), }\n",
        "```\n",
        "\n",
        "应用进一步的运算非常方便。例如，您可以选择 `Dataset` 的 `nox` 和 `no2` 字段，并计算差值："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qCoueXYZOvqZ"
      },
      "outputs": [],
      "source": [
        "dataset = tfio.experimental.IODataset.from_sql(\n",
        "    query=\"SELECT nox, no2 FROM AirQualityUCI;\",\n",
        "    endpoint=endpoint)\n",
        "\n",
        "dataset = dataset.map(lambda e: (e['nox'] - e['no2']))\n",
        "\n",
        "# check only the first 20 record\n",
        "dataset = dataset.take(20)\n",
        "\n",
        "print(\"NOx - NO2:\")\n",
        "for difference in dataset:\n",
        "  print(difference.numpy())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xO2pheWEPQSU"
      },
      "source": [
        "现在，可以将创建的 `Dataset` 直接传递至 `tf.keras` 用于训练或推理了。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AzQoxSlHVv1k"
      },
      "source": [
        "## 参考文献\n",
        "\n",
        "- Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.\n",
        "- S. De Vito, E. Massera, M. Piga, L. Martinotto, G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750-757, ISSN 0925-4005"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [
        "Tce3stUlHN0L"
      ],
      "name": "postgresql.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
